Problem

Typical facial expression recognition algorithms unpack videos, read the frames, and make predictions on each frame.

Additionally, many ML practitioner like to include predictions as captions on the video frames.

This method has two major disadvantages:

  1. Excessive number of predictions per second. This fact decreases the caption's readability.

  2. Frequent erroraneous predictions. Human facial expressions can be complex. Given only one video frame as an input, it is understandable for this ML algo to make bad predictions.

Purpose

This project's purpose is to use feature engineering techniques to improve existing facial recoginition algorithms. With these techniques, the output video shall have:

  1. less erroraneous predictions
  2. eliminate excessive number of predictions, and increase prediction caption readabiltiy.

Results

See the example in the folloiwng cell.

Allie (played by Rachel McAdams) was filled with both anger and sadness when she discovered her love for Noah (played by Ryan Gosling).

She was conflicted becaues she was engaged with another man, but found herself still in love with Noah.

Additionally, she was angry at this complex situation, and became enraged when Noah slandered her character.

The left screen shows predictions made by the deepface library w/o feature engineering techniques.

The right screen shows predictions made by the deepface library after applying feature engineering techniques.

Results:

  1. https://youtu.be/DXF8fFZGcFg
  2. https://youtu.be/GzK-TwfRjMY

You may also like to turn the video volume off as a fun exercise. See if you can make similar predictions without audio clues.

Model explanation

For this section, I like to explain my model creation process step by step.

Afterwards, we shall explore the feature engineering technique's effectiveness in section 4:

(Model) vs. (Model after applying feature engineering techniques)

Import library

Reading the video & applying deepface

Feature Engineering


There are many noises in the predictions. Here are some methods to smooth/ eliminate the noises.

  1. If a prediction has low probability, let's drop it to 0.

The probability threshold is 50 %. This threshold is discovered after an iterative process of toying with the threshold parameter & observing the prediction results. We can further validate this parameter by fitting the model on other videos.

Clustering

Recall we stated singular frame predictions are often bad since ML only has one frame of input. Given one frame, many complex expressions can be difficult to classify for a human being. However, we can leverage nearby predictions to increase model prediction accuracy.

For example, examine the consequent images below.

series

If the predictions are clustered together within the timeline, they are more likely to be true predictions since the ML algo continuously predict the same emotions.

If cases such as (frame0, frame1, frame2, frame3, frame4...etc) ---> (angry, angry, angry, neutral, neutral, angry, angry, angry, angry) occurs:

We can extrapolate that the actor is supposed to deliver anrgy emotions in frame 3 & 4. Since one frame is 1/25th of a second, I don't think the actor purposely pause his/her delivery for 2/25th seconds.

On the other hand, if we see (neutral, angry, neutral, sad, fear)...

It is more likely the algo couldn't make an accurate prediction of the actor's delivery based only on singular frame images.

There are a few 1D clustering techniques:

  1. Jenks natural breaks optimization

  2. KDE & splitting into clusters via local minimas

However, I think a simple customized clustering approach will work well.

Recall last text box, we have a decent idea of how to intuitively cluster them via some simple rules involving controlling # of points in a cluster.

Without context clues, ML can sometimes make wrong predictions based on only one frame. To combat this, we are going to assume that any visible emotion shall last at least 1/5 of a second. Since this video's FPS is 25, a visible emotion shall last at least 5 frames to qualify for our visualization.

Comparing existing model vs. model (after feature engineering)

Typical Method

Typical facial expression recognition algo predicts emotions frame by frame.

While this method is useful on some videos, it can be inappropiate for emotional videos.

Videos where actors/actresses express large range of emotion can cause the predictions to flutuate frequently.

The result? A series of predictions that look like flickering lights!

ex1: https://youtu.be/UHdrxHPRBng?t=1356

ex2: https://youtu.be/4aEewKHQ3Eg

ex3: https://youtu.be/sTNMLLWnG1U?t=153

(Notice the model predicts YTuber to be 'sad' while he blinks his eye)

my method

Examining additional examples

Let's check out some more examples!